-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "ga4gh" namespace identifier registry and instructions/policy to TASC repo #32
Add "ga4gh" namespace identifier registry and instructions/policy to TASC repo #32
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in comments below, I recommend that a new repo be created for shared assets, config, etc., that the authoritative source for prefixes be machine-readable, and that the documentation be adjacent to that code. Therefore, this PR would move to a new repo.
|
||
To add an entry to the type prefix registry: | ||
* Fork the TASC repository in your own user space | ||
* In your user space, create a new branch from `ga4gh:master` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GA4GH should use a consistent default branch name across all projects. Most projects are migrating to using main
rather than master
for reasons discussed elsewhere. That is my preference too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree with this and certainly we should move what we can do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I will bring this up at TASC to rename it as main
for at least this repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jb-adams is there an open GH issue to track this so we can resolve this thread?
To add an entry to the type prefix registry: | ||
* Fork the TASC repository in your own user space | ||
* In your user space, create a new branch from `ga4gh:master` | ||
* In the new branch, add an additional row in the markdown table in `type-prefix-registry.md` with your proposed type prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the authoritative table should be code, preferably YAML, and not merely in documentation. That code should probably be in a new central/shared/core repo under TASC's purview, e.g., ga4gh-common
. PRs to add prefixes should be in that code. And, the documentation for those should be in the same repo, adjacent to the changes.
When I originally implemented prefixes for VRS, I anticipated separating GA4GH shared components from VRS. For this reason, the VRS repo has a separate file ga4gh.yaml.
So, the most straightforward action would be to move this file to another repo and VRS clients would include that (as a subrepo or with vendoring).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. the new repo would be the authoritative GA4GH namespace registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also strongly agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reece do you have any examples of how the registry would be used in a machine-readable sense? We discussed this at TASC and saw it mainly as human-readable documentation. i.e. developers of data platforms/systems would read through the registry and find GA4GH prefixes related to data models they are using, and then apply our identifier system to their objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jb-adams not that I want to speak for @reece or @larrybabb, but I do think the intent here is to be descriptive of the repo contents, as you suggest.
It might make sense to rename this repo to do so, but I think that there are several other TASC-related concerns that are tracked here; do you want to limit the scope of the TASC repo to concern itself only with the ga4gh-common
technical product?
I think @reece sees such a repo to eventually serve as a complete replacement for the GA4GH YAML in VRS in addition to other technical artifacts to be coordinated / blessed centrally by GA4GH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jb-adams : Apologies for missing the questions two weeks ago.
As @ahwagner suggested, please see the VRS repo. The file he pointed to (ga4gh.yaml) is used to build a map of type ⇒ prefix, which is then used during identifier assignment. That is, when an object of type Allele is serialized, the prefix for that identifier is specified in the file ga4gh.yaml. In an ideal world, this mapping would be stored in exactly one place and all code and documentation would be derived from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jb-adams : IMO, "TASC" is a group and doesn't help understand what this code is about. "config" or "globals" might be a better name. Admittedly, I know of only one file that would go there now.
Despite being a primarily Python guy these days, I do think that files of this sort should be language neutral.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re repository naming: This is trivia. It doesn't matter what repo this prefix registry is in; what matters is that it exists. This repository currently contains information about the TASC committee and the registry of Service Info types, and the PR proposes to add a registry of prefixes. Both of these registries are administered by the TASC committee, and IMHO this ga4gh/TASC repository is a convenient place to gather them.
The place that this information ought to be located in is a GA4GH technical web site. To date, GA4GH as an organisation has not provided a technical web site for the information produced by its working groups. The rumblings are that this is finally on the way to changing. In the meantime, this repository with the PR as proposed is as good a staging area as any; IMHO “keep it simple” applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re maps of type ⇒ prefix: VRS's ga4gh.yaml file contains entries like Allele: VA
. But what does Allele
mean?
In the VRS context, it specifically means VRS's Allele
object as defined in your schema. That's good and concrete.
However in the (lack of) context of this general registry list, it's just a word. Different groups will inevitably mean slightly different things when they use the word. That's why IMHO this registry's entries need to consist of the reserved prefix (VA
), a prose description of the intention (“An allele, encompassing blah blah”), and links to its specific incarnations in individual GA4GH protocols (link to VRS's Allele
initially; eventually also links to other groups' protocols' uses of VA
).
I guess this is the crux of why I don't see this registry as a machine-readable list. What would the machine-readable entries contain? We are trying to make a list of registered prefixes like VA
(and collate their use in GA4GH formats and protocols); we are not trying to define an ontology of terms like Allele
. It's great to have a type/prefix map in a concrete context like VRS, but I don't see that it's feasible or useful for TASC to attempt in the general context.
|
||
The computed identifier scheme proposed in the VR Specification computes identifiers from the data itself. Because identifers depend on the data, groups that independently generate the same variation will generate the same computed identifier for that entity, thereby obviating centralized identifier systems and enabling identifiers to be used in isolated settings such as clinical labs. | ||
|
||
The computed identifier mechanism is broadly applicable and useful to the entire GA4GH ecosystem. Adopting a common identifier scheme will make interoperability of GA4GH entities more obvious to consumers, will enable the entire organization to share common entity definitions (such as sequence identifiers), and will enable all GA4GH products to share tooling that manipulate identified data. In short, it provides an important consistency within the GA4GH ecosystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst this has origins within the computed identifier space, is there a reason to limit this to just computed identifiers? Unambiguous is the target right? So long as that can be achieved then no dramas I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do think it would be limited only to computed identifiers.
However, if we were to allow one or more authorities to mint, maintain and manage curated identifiers then we would need some oversight to make sure each prefix was assigned to one and only one authority so as to avoid collissions.
I suppose it would be technically okay to do this, but there would need to be some oversight, guidelines and transparency on assuring certain prefixed namespaces are "owned" by one and only one authority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computed identifiers would be my preference too, as curated identifiers already have a route via identifiers.org etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically, the way GA4GH-namespaced identifiers are managed is already tightly coupled to the associated specification, so I can see the argument to allow non-computed identifiers in this namespace. However, in the case of entities that have registered identifiers (e.g. DUO codes) it makes more sense to have a dedicated namespace for those. So I think it makes sense to limit the scope to computed identifiers as written here.
Regardless, I agree with @larrybabb that the key message here is for TASC to create and maintain oversight, guidelines, and transparency on identifier prefixes within the GA4GH namespace.
@@ -0,0 +1,10 @@ | |||
| Type Prefix | Class Name | Definition | Example | Reference | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be computationally consumable IMO. I don't have a concern with a textual representation but the primary source should be elsewhere (this might be covered by Reece's comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes machine readable-readable at the minimum, human-readable form may exist in sync with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jb-adams see GA4GH YAML in VRS for an example of the computable resource we are seeking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments inline that I think need addressing
I think we may be considering embedding the version in the namespace as well? not sure yet, but this should be considered how versioning will be handled. |
| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)| | ||
| VA | Allele | ... | `ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) | | ||
| VH | Haplotype | ... | `ga4gh:VH:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) | | ||
| VAB | Abundance | ... | `ga4gh:VAB:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we changed Abundance to CopyNumber and I think VAB should be VCN (Alex W can verify)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larrybabb is correct. Though in fairness we haven't updated the documentation on VRS latest to reflect this. Made an issue to address this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I wait until the VRS PR merges before I update it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please wait for 1.2.1; I will put in PRs this week so that we may merge and release soon.
|
||
The computed identifier scheme proposed in the VR Specification computes identifiers from the data itself. Because identifers depend on the data, groups that independently generate the same variation will generate the same computed identifier for that entity, thereby obviating centralized identifier systems and enabling identifiers to be used in isolated settings such as clinical labs. | ||
|
||
The computed identifier mechanism is broadly applicable and useful to the entire GA4GH ecosystem. Adopting a common identifier scheme will make interoperability of GA4GH entities more obvious to consumers, will enable the entire organization to share common entity definitions (such as sequence identifiers), and will enable all GA4GH products to share tooling that manipulate identified data. In short, it provides an important consistency within the GA4GH ecosystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically, the way GA4GH-namespaced identifiers are managed is already tightly coupled to the associated specification, so I can see the argument to allow non-computed identifiers in this namespace. However, in the case of entities that have registered identifiers (e.g. DUO codes) it makes more sense to have a dedicated namespace for those. So I think it makes sense to limit the scope to computed identifiers as written here.
Regardless, I agree with @larrybabb that the key message here is for TASC to create and maintain oversight, guidelines, and transparency on identifier prefixes within the GA4GH namespace.
To add an entry to the type prefix registry: | ||
* Fork the TASC repository in your own user space | ||
* In your user space, create a new branch from `ga4gh:master` | ||
* In the new branch, add an additional row in the markdown table in `type-prefix-registry.md` with your proposed type prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also strongly agree.
|
||
## Proposal | ||
|
||
The following algorithmic processes, described in depth in the VR [Computed Identifiers](https://vrs.ga4gh.org/en/1.0/impl-guide/computed_identifiers.html#computed-identifiers) proposal, are included in this proposal by reference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following algorithmic processes, described in depth in the VR [Computed Identifiers](https://vrs.ga4gh.org/en/1.0/impl-guide/computed_identifiers.html#computed-identifiers) proposal, are included in this proposal by reference: | |
The following algorithmic processes, described in depth in VRS [Computed Identifiers](https://vrs.ga4gh.org/en/1.0/impl-guide/computed_identifiers.html#computed-identifiers), are included in this proposal by reference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahwagner this has been updated, can you confirm and resolve?
|
||
The `digest` is computed as described above. The type_prefix is a short alphanumeric code that corresponds to the type of object being represented. If this propsal is accepted, this “type prefix map” would be administered by GA4GH. (Currently, this map is maintained in a YAML file within the vr-spec repository, but it would be relocated on approval of this proposal.) | ||
|
||
We propose the following guidelines for type prefixes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We propose the following guidelines for type prefixes: | |
These are our recommendations for type prefixes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahwagner this has been updated, can you confirm and resolve?
| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)| | ||
| VA | Allele | ... | `ga4gh:VA.EgHPXXhULTwoP4-ACfs-YCXaeUQJBjH_` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) | | ||
| VH | Haplotype | ... | `ga4gh:VH:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) | | ||
| VAB | Abundance | ... | `ga4gh:VAB:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larrybabb is correct. Though in fairness we haven't updated the documentation on VRS latest to reflect this. Made an issue to address this here.
As I understand it, the primary purpose of this namespace identifier registry (in particular as it is written in this PR) is to provide a unified reference for spec authors listing the prefixes available and summarising what they mean. At least initially, IMHO machine processing would be a secondary concern; I would be interested to hear just what use cases people arguing for machine readability have in mind. On the basis of being a listing primarily for human readers, IMHO the PR is a pretty good first cut and starting point. Rather than format the list as a table, I would suggest a document with a section for each prefix. For each item, this would start with boilerplate giving the prefix, class name(s), link to a reference spec(s); and then have space for (up to) several paragraphs of explanation of what that kind of identifier represents. I have not yet reviewed the digest serialisation process, included from VRS by reference. On the face of it, I would be surprised if it were appropriate to limit all GA4GH namespace identifiers to this mechanism — but I am willing to be convinced. |
83db059
to
eb13691
Compare
I think the open issues boil down to:
Regarding (1), I +1 @jmarshall 's idea for making each type prefix in the human-readable document as its own section, each with a consistent set of fields and space for longer description. I'm supportive of creating a machine-readable YAML or JSON that stays 1:1 with the human-readable documentation. Regarding (2), my initial reaction is that we can allow non-computed identifiers within the registry, so long as we clearly indicate which prefixes use digest identifiers and which don't. But I think the VRS spec maintainers' opinions have more weight here, especially considering how they foresee the impact on downstream interoperability given either decision. |
@jb-adams I think that sounds about right. I also +1 the proposal by @jmarshall . I think my primary concern is that once defined, the prefixes are forever reserved across the GA4GH organization and must have a static definition. Therefore, anything that is adopted as a prefix must be considered permanent. In addition to providing a page to define these, we need a process to determine when a proposed prefix is mature enough to cast in stone. |
@@ -0,0 +1,10 @@ | |||
| Type Prefix | Class Name | Definition | Example | Reference | | |||
|-------------|------------|------------|---------|-----------| | |||
| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| SQ | Sequence | ... | `ga4gh:SQ:...` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)| | |
| SQ | Sequence | ... | `ga4gh:SQ.…` | [VRS Computed Identifers](https://vrs.ga4gh.org/en/1.2.0/impl-guide/computed_identifiers.html)| |
(…and similarly throughout the table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred three periods to the ellipsis character, but this probably wouldn't cause any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three periods would obviously be confusing next to a literal period character, but either way the ellipsis is just a placeholder until the examples (if they are retained) are filled in. But the comment is a reminder that the :
should be a .
.
Co-authored-by: John Marshall <[email protected]>
This was discussed on today's TASC call. Some points that were touched on include:
We will be aiming to take this work forward with TASC, the VRS leads and any other interested parties and are hoping to schedule a call to focus on this topic. |
This is great, thanks for advancing this work within TASC. Happy to bring forward associated conversations happening in VR. Here are slides from today's VR call with a new versioning proposal, for reference in those future conversations. |
PR closed automatically because of the deletion of the master branch while changing the default branch to main. We will coordinate with the submitter to reopen this. |
This PR aims to create the documentation for using and contributing to the
ga4gh
namespace identifier and type prefixing system, as well as creating the initial type prefix registry itself.Background
,Proposal
, andType Prefixes
sections in the process document were taken from the Proposal for GA4GH-wide Computed Identifier Standard. These sections need to be further edited in my branch to make them less VRS-specificType Prefix Submission Process
section was adapted from @rishidev 's earlier PR on the same topic@ahwagner @rrfreimuth @mamanambiya @jmarshall @susanfairley